Feat: Enhance multimodality #1070

arcaputo3 · 2024-10-11T18:14:51Z

Summary:

I think multimodal.py is a huge step in the right direction for Instructor, especially as vision becomes more mainstream. Since every provider handles images uniquely, the hope is that the interface will only need to define to_<provider> methods, and the Image class can handle serialization and routing as needed.

Though experimental, I propose image autodetection to streamline vision use cases. Whether image data comes from a path, URL, or base64 encoded string, it should always be possible to automatically figure it out from a source string (bar edge cases where a user can be more explicit if needed). This allows users to specify content as a list of strings, and Instructor can handle the rest by setting autodetect_images=True.

Change notes:

Upgrade Image to support all valid OpenAI and Anthropic vision mime types
Use mime type detection rather than extensions
Add image autodetection for ease of use

Important

Enhance Image class to support autodetection of image sources and update tests and documentation accordingly.

Behavior:
- Enhance Image class in multimodal.py to support autodetection of image sources from URLs, file paths, or base64 strings.
- Add autodetect_images parameter to convert_messages() in multimodal.py to enable automatic image detection.
Tests:
- Add tests in test_multimodal.py, test_anthropic/test_multimodal.py, and test_openai/test_multimodal.py for Image autodetection and conversion functions.
- Add fixtures base64_jpeg and base64_png in test_multimodal.py for testing base64 image handling.
Documentation:
- Update multimodal.md to include usage of autodetect_images parameter in examples.

^{This description was created by}^{for 82c6846. It will automatically update as commits are pushed.}

- Upgrade `Image` to support all valid OpenAI and Anthropic vision mime types - Use mime type detection rather than extensions - Add image autodetection for ease of use

ellipsis-dev

👍 Looks good to me! Reviewed everything up to 82c6846 in 35 seconds

More details

Looked at 632 lines of code in 7 files
Skipped 0 files when reviewing.
Skipped posting 11 drafted comments based on config settings.

1. instructor/multimodal.py:42

Draft comment:
Check URL validity before using Path(source).is_file() to avoid false positives when the string is a valid URL.
Reason this comment was not posted:
Confidence changes required: 80%
The autodetect method in Image class uses Path(source).is_file() to check if a string is a file path. This can lead to incorrect results if the string is a valid URL that coincidentally matches a file path on the system. It's better to check URL validity first.

2. instructor/multimodal.py:102

Draft comment:
Handle cases where requests.head does not return a Content-Type header or returns an incorrect one.
Reason this comment was not posted:
Confidence changes required: 70%
The from_url method in the Image class uses requests.head to determine the MIME type if mimetypes.guess_type fails. This is a good fallback, but it should handle cases where the Content-Type header is missing or incorrect.

3. instructor/multimodal.py:64

Draft comment:
Consider logging a warning in autodetect_safely when autodetection fails and the source is returned as a string.
Reason this comment was not posted:
Confidence changes required: 50%
The autodetect_safely method in the Image class returns the source as a string if autodetection fails. This is a good fallback, but it should log a warning to inform the user that autodetection failed.

4. instructor/multimodal.py:49

Draft comment:
Include the source value in the ValueError message in autodetect for better debugging information.
Reason this comment was not posted:
Confidence changes required: 50%
The autodetect method in the Image class raises a ValueError if the source is not detected as a valid image. This is appropriate, but the error message could be more informative by including the source value.

5. instructor/multimodal.py:90

Draft comment:
Include the exception message in the ValueError raised in from_raw_base64 for better debugging information.
Reason this comment was not posted:
Confidence changes required: 50%
The from_raw_base64 method in the Image class raises a ValueError with a generic message if an exception occurs. It would be more informative to include the exception message in the error message.

6. docs/concepts/multimodal.md:32

Draft comment:
If this is a new md file, ensure it is added to mkdocs.yml for documentation generation.
Reason this comment was not posted:
Confidence changes required: 80%
The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.

7. instructor/multimodal.py:27

Draft comment:
The autodetect method is quite complex. Consider adding comments to explain the logic, especially the decision-making process for determining the image type.
Reason this comment was not posted:
Confidence changes required: 80%
The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.

8. instructor/multimodal.py:67

Draft comment:
The is_base64 method is used multiple times. Consider adding a comment to explain its purpose and the regex used.
Reason this comment was not posted:
Confidence changes required: 80%
The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.

9. instructor/multimodal.py:93

Draft comment:
The from_url method is quite complex. Consider adding comments to explain the logic, especially the decision-making process for determining the media type.
Reason this comment was not posted:
Confidence changes required: 80%
The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.

10. instructor/multimodal.py:112

Draft comment:
The from_path method is quite complex. Consider adding comments to explain the logic, especially the decision-making process for determining the media type and handling file errors.
Reason this comment was not posted:
Confidence changes required: 80%
The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.

11. instructor/process_response.py:667

Draft comment:
Ensure that the convert_messages function is tested for the new autodetect_images parameter.
Reason this comment was not posted:
Confidence changes required: 80%
The PR introduces new functionality related to image handling, including autodetection of image types. This is a significant change that should be documented and tested. The PR includes changes to the multimodal.py file, which is part of the library code. Therefore, we need to ensure that the documentation and tests are updated accordingly.

Workflow ID: wflow_jUy5gxy21mUg8qVT

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

- More ruff

- Fix error match

jxnl · 2024-10-12T02:06:29Z

this looks great but mind cleaning up the diff

arcaputo3 · 2024-10-12T09:55:35Z

this looks great but mind cleaning up the diff

yes, I ruffed the whole repo by mistake lol. fixed now

arcaputo3 · 2024-10-12T09:58:44Z

Ideally I'd like to be able to support this feature with Anthropic prompt caching more natively as currently this would require some gymnastics. My thought is that we could support a generic content dict such as

{"type": "image", "source": <generic_str_source>}

and that way users could specify the "cache_control" key if they like

- Optionally cache expensive operations

- Pyright and remove double media type access for Anthropic

jxnl · 2024-10-12T15:44:44Z

oh good idea on the cache key. as long as its well documented im +1

- Improve LRU caching

- Use Anthropic `cache_control` syntax and update LRU cache to handle mapping types - Test cache hits and enable / disable

- Ruff

arcaputo3 · 2024-10-15T15:22:52Z

@jxnl let me know what you think. I am not actively using Gemini, but it would be great if we could add Google support at a later date.

jxnl · 2024-10-16T03:41:09Z

why not use functools.lru_cache looks like a lot of extra code to implement the cache

- Simplify caching and prompt caching logic - Remove cache config feature for now

arcaputo3 · 2024-10-16T14:48:42Z

Okay I realized that I didn't even need to hash the cache control logic and refactored everything to be much more readable

... with no response model

- Remove unneeded var assignment

Enhance multimodality

82c6846

- Upgrade `Image` to support all valid OpenAI and Anthropic vision mime types - Use mime type detection rather than extensions - Add image autodetection for ease of use

ellipsis-dev bot reviewed Oct 11, 2024

View reviewed changes

arcaputo3 changed the title ~~Enhance multimodality~~ Feat: Enhance multimodality Oct 11, 2024

arcaputo3 added 3 commits October 11, 2024 14:33

Ruff and pyright

21febac

Update multimodal.py

948f49c

- More ruff

Update test_multimodal.py

5ca6f9a

- Fix error match

arcaputo3 added 2 commits October 12, 2024 05:52

Add multimodality for no response model

0615b0a

Undo unneeded Ruff

a99b275

arcaputo3 added 2 commits October 12, 2024 06:33

Update multimodal.py

8370f03

- Optionally cache expensive operations

Update multimodal.py

ae5a89c

- Pyright and remove double media type access for Anthropic

arcaputo3 and others added 5 commits October 14, 2024 09:55

Merge branch 'main' into multimodel-enhancements

d7b2b00

Add Anthropic prompt caching support

5e5e23c

- Improve LRU caching

Add cache controls

a5c3248

- Use Anthropic `cache_control` syntax and update LRU cache to handle mapping types - Test cache hits and enable / disable

Update multimodal.py

822a827

- Ruff

Merge branch 'main' into multimodel-enhancements

3c47476

arcaputo3 and others added 2 commits October 16, 2024 10:44

Merge branch 'main' into multimodel-enhancements

79f878b

Remove excess caching logic

a19d704

- Simplify caching and prompt caching logic - Remove cache config feature for now

arcaputo3 and others added 3 commits October 16, 2024 11:39

Handle Anthropic style messages

b771f33

... with no response model

Update multimodal.py

44b3316

- Remove unneeded var assignment

Merge branch 'main' into multimodel-enhancements

86cf6ba

jxnl assigned ivanleomk Oct 20, 2024

jxnl enabled auto-merge (squash) October 20, 2024 14:58

jxnl disabled auto-merge October 20, 2024 15:01

jxnl merged commit 59f1d6a into instructor-ai:main Oct 20, 2024
8 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Enhance multimodality #1070

Feat: Enhance multimodality #1070

arcaputo3 commented Oct 11, 2024 •

edited by ellipsis-dev bot

Loading

ellipsis-dev bot left a comment

jxnl commented Oct 12, 2024

arcaputo3 commented Oct 12, 2024

arcaputo3 commented Oct 12, 2024

jxnl commented Oct 12, 2024

arcaputo3 commented Oct 15, 2024

jxnl commented Oct 16, 2024

arcaputo3 commented Oct 16, 2024

Feat: Enhance multimodality #1070

Feat: Enhance multimodality #1070

Conversation

arcaputo3 commented Oct 11, 2024 • edited by ellipsis-dev bot Loading

Summary:

Change notes:

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

jxnl commented Oct 12, 2024

arcaputo3 commented Oct 12, 2024

arcaputo3 commented Oct 12, 2024

jxnl commented Oct 12, 2024

arcaputo3 commented Oct 15, 2024

jxnl commented Oct 16, 2024

arcaputo3 commented Oct 16, 2024

arcaputo3 commented Oct 11, 2024 •

edited by ellipsis-dev bot

Loading